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(57) ABSTRACT 

A system for adaptively generating a composite noisy 
speech model to process speech in, e.g., a nonstationary 
environment comprises a speech recognizer, a re -estimation 
circuit, a combiner circuit, a classifier circuit, and a dis- 
crimination circuit. In particular, the speech recognizer 
generates frames of current input utterances based on 
received speech data and determines which of the generated 
frames are aligned with noisy states to produce a current 
noise model. The re-estimation circuit re-estimates the pro- 
duced current noise model by interpolating the number of 
frames in the current noise model with parameters from a 
previous noise model. The combiner circuit combines the 
parameters of the current noise model with model param- 
eters of a corresponding current clean speech model to 
generate model parameters of a composite noisy speech 
model. The classifier circuit determines a discrimination 
function by generating a weighted PMC HMM model. The 
discrimination learning circuit determines a distance func- 
tion by measuring the degree of mis-recognition based on 
the discrimination function, determines a loss function based 
on the distance function, which is approximately equal to the 
distance function, determines a risk function representing 
the mean value of the loss function, and generates a current 
discriminative noise model based in part on the risk 
function, such that the input utterances correspond more 
accurately with the predetermined model parameters of the 
composite noisy speech model. 

20 Claims, 2 Drawing Sheets 
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ON-LINE BACKGROUND NOISE appear to be a B, and a known probability that in state A the 

ADAPTATION OF PARALLEL MODEL symbol will be corrupted to appear as a C. The same is true 

COMBINATION HMM WITH for B and C. To determine the best state sequence associated 

DISCRIMINATIVE LEARNING USING with the observations of this "noisy" state sequence, the text 

WEIGHTED HMM FOR NOISY SPEECH 5 recognition device must determine, through probabilities, 

RECOGNITION which letters are most likely to be in the sequence. 

With respect to speech recognition, current technologies 
FIELD OF THE INVENTION ^ave p ro d uce d fairly good results in recognizing speech in 
The present invention relates to a speech recognition arj ideal noiseless environment. However, when speech 
method, and, more particularly, relates to a two stage Hidden 10 recognition is conducted in real-life environments, the 
Markov Model (HMM adaption method utilizing an "on- results have been far * ess desirable. One of the main causes 
line" Parallel Model Combination (PMC) and a discrimina- of lnis phenomenon is the interference of background noise 
tive learning process to achieve accurate and robust results m me environment. Since background noise may be con- 
in real world applications without having to collect envi- sidered additive in nature, one can either filter the noise from 
ronment background noise in advance. 15 tne signal source or compensate a recognition model by 

transferring the model parameters obtained through clean 

BACKGROUND OF THE INVENTION speech training data to the speech model having noise 

Many electronic devices need to determine a "most interference (as will be described below with reference to 

likely" path of a received signal. For example, in speech, „ n the convcnuonal parallel model combination (PMC) 

text, or handwriting recognition devices, a recognized unit 20 a PP' oach )- h other words, an approach is necessary that 

(i.e., sound, syllable, letter, or word) of a received signal is ^™ { ™ actual *P cech from background noise, 

determined by identifying the greatest probability that a ^ curreDt s P eech Sl ^ 1 processing methods can be 

particular sequence of states was received. This determina- generally divided into three categories: 1) seeking robust 

tion may be made by viewing the received signal as gener- features, known as discriminative measurement similarity, 

ated by a hidden Markov model (HMM). A discussion of 25 2 ) s P eech enhancement, and 3) model compensation. 

Markov models and hidden Markov models is found in The first category, seeking robust features, compares the 

Rabiner, "A Tutorial on Hidden Markov Models and background noises with a known databank of noises so that 

Selected Applications in Speech Recognition", Proceedings the detected noises may be canceled out. However, this 

of the IEEE, Vol. 77, No. 2, February 1989. Also, this signal 3Q method is quite impractical since it is impossible to predict 

may be viewed as generated by a Markov model observed every noise, as noises can vary in different environment 

through a "noisy" process. This is discussed in Forney, "The situations. Further, the similarity of different noises and 

Viterbi Algorithm", Proceedings of the IEEE, Vol. 61, No. 3, noises having particular signal-to-noise ratios (SNR) also 

March 1973. The contents of these articles are incorporated make this method inadequate. 

herein by reference. 35 The second category, speech enhancement, basically pre- 

Briefly, a Markov model is a system which may be processes the input speech signals, prior to the pattern 

described as being in any one of a set of N distinct states matching stage, so as to increase the SNR. However, an 

(while in a hidden Markov model the states are unknown). enhanced signal noise ratio does not necessarily increase the 

At regularly spaced time intervals, the system makes a recognition rate, since the enhanced signals can still be 

transition between states (or remains in the same state) 4Q distorted to some degree. For this reason, the methods of the 

according to a set of transition probabilities. A simple three speech enhancement category usually cannot deliver accept - 

state Markov model is illustrated in FIG, 1. able results. 

FIG. 1 shows a three state transition model 15. In this The third category, model compensation, deals with rec- 
model, it is assumed that any state may follow any other ognition models. In particular, it compensates recognition 
state, including the same state repeated. For each state, there 45 models to adapt to the noisy environment. The most direct 
is a known probability indicating the likelihood that it will approach of this category is to separately collect the speech 
be followed by any other state. For example, in the English signals with the interference noise in the application envi- 
language, this probability may be statistically determined by ronment and then train the recognition models. It is, 
determining how often each letter is followed by another however, difficult to accurately collect these kinds of train- 
letter (or itself). In this illustration, assume that state 1 50 ing materials, thereby rendering this approach impractical, 
[indicated as SJ is the letter A, state 2 [indicated as S 2 ] is However, a recent model compensation method, parallel 
the letter B, and state 3 [indicated as S 3 ] is the letter C model combination (PMC), developed by Gales and Young, 
Probabilities are assigned to the likelihood that any one of avoids the necessity to collect the training material in 
these letters will follow the same or another letter. In this advance and is therefore very popular, 
example, an illustrative probability of 0.1 has been assigned 55 PMC assumes that speech to be recognized is modeled by 
to the likelihood that A will be followed by another A, 0.4 a set of continuous density hidden Markov models 
that A will be followed by a B, and 0.5 that A will be (CDHMM) which have been trained using clean speech 
followed by a C. The same is done for the letters B and C, data. Similarly, the background noise can also be modeled 
resulting in a total of nine probabilities. In this model, the using a single stale CDHMM. Accordingly, speech that is 
state is apparent from the observation, that is, the state is go interfered by additive noises can be composed of a clean 
either A, B, or C in the English language. speech model and a noise model. T he parajleLmodel-com r 

Often the states of the model generating the observations b ination is shown in FIG. 2 . 
cannot be observed, but may only be ascertained by deter- In brief, the symbols of ju c and 2 C , discussed below, 
mining the probabilities that the observed states were gen- represent the mean vect or and the covaria nce matrix, 
erated by a particular model. For example, in Ihe example of 65 respectively, of any state output distribution in a cepstral 
FIG. 1, assume that due to "noise", there is a known domain. Cepstral parameter are derived from the log spec- 
probability that in state A the symbol may be corrupted to trum via a discrete cosine transform and is represented by a 
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matrix C. Since the discrete cosine transform is linear, the The first stage comprises an on-line parallel model corn- 
corresponding mean vector and the covariance matrix in the bination. The advantages of this on-line PMC method over 
cepstral domain (represented by ^ and 2* respectively) can the conventional PMC method lies mainly in its avoidance 
be presented with the following equations: of the need to collect the background noise in advance. 
, , c 5 Inst ead, the background noise is filtered from the input nm'sv 
* M spe ech and is linearly combined with corresponding clean 
i'-c- 1 i c (c 1 ) r (l) speech liMMs t o form a robust composite HMM. 

Trj0 . j. . >i_ *• j • i « i 4 , , In addition, a discriminative learning method is incorpo- 

If G aussian dist ribution is assumed in both the cepstral and t , . tI _ , t iL ° 4 r c 

I , j . tt _ t . . rated in the second stage to increase the recognition rate or 

log spectral domains, then the mean vector and covanance 10 ^ %itrn 

matrix of the I th component in the linear domain can be y 

expressed as: BRIEF DESCRIPTION OF THE DRAWINGS 

wscxpfof+Tf/p,) The following detailed description, given by way of 

v f fx 1 h 1 1 <n\ example and not intended to limit the present invention 

solely thereto, will best be understood in conjunction with 
If the speech signal and the noise signal are assumed to be the accompanying drawings, where similar elements will be 
in dependent of each other and are additive in a linea r represented by the same reference symbol, in which: 
domain, then the combined mean vectojuuidJhe covariance FIG. 1 illustrates a three state Markov model; 
matrix can be expressed as: ^ nG 2 sfaows a conventional para Uel model combination 

(PMC) process; and 

FIG. 3 shows a two stage Hidden Markov Model (HMM) 
s-g^+r (3) adaption method in accordance with the present invention. 

where (u, 2) are the speech model parameters and (a, 2) are 25 DETAILED DESCRIPTION OF THE 

the noise model parameters. The factor of g is a gain INVENTION 
matching term introduced to account for the fact that the 

level of the original clean speech training data may be FIG - 3 schematically illustrates the two stage Hidden 
different from that of the noisy speech. Markov Model (HMM) adaption method having a first stage 

The above mean vector and covariance matrix may be n "on-line" PCM 10 and a second stage discriminative learn- 
expressed in the log spectral domain as: 30 ^ V TOCCSS 20 * accordance with the present invention. 

Although the second stage 20 further improves the accuracy 
! . \ (4) of the overall process (by resulting in a model closer to the 

?z{ = logfo) - -log! -| + 1 model space of the testing data), the present invention is also 

' applicable using only the first stage while still achieving 

^ ^ Zij ^ 35 stellar results. 

. „i og _ - + i ^ ^ on-line PCM 10 comprises a speech 

recognizer 11, a noise model re-estimation circuit 12, a clean 
speech HMM circuit 18, a PMC circuit 16, and a composite 
Further, when it is transformed back into the cepstral HMM circuit 14. "Noisy speech" data is sent to speech 
domain, the values of the mean vector and the covariance *o reC ognizer 11 which uses a Viterbi decoding scheme to 
matrix can be expressed as: determine frames of input utterances. In other words, the 

input utterances are recognized in speech recognizer 11 
based on the testing data itself. Recognizer 11 further 
^cT(C) r (5) determines which frames of the recogni zed input ut terances 

r,w^ 45 are aligned with noise states. T he aligned sequence, of 

Although the PMC method has been proven to be effec- frames a frtnTrTel5rIcTed (as the current noise model) and 
tive against additive noises (there is no need to collect noise sem t 5Xn^ise model re-estimation circuit 12. 

interference signals in advance), it does require that the m . , . . . . , . . 

? . , « 7 , . t • . The current noise model is re-estimated using an uiter- 

background noise signals be collected in advance to train the , ( . m A „, . - „, w „, — 

. polation met hod, suc h a s r ecursive ML (maximum 

noise model. This ndse model is then combined with the 50 lik ^odT^U^^ 

original recognition model, trained by the clean speech, to ^ n55 ^del^asobteiS 

become the model that can recognize the environment fram esTThus, let X(n) stand for thelarameters estimated 

background no.se. As is evident m actual applications, noise finite noise poTiio ^ of the previous ut t erances . Next, let 

changes with time so that the conventional PMC method us assume ^ ^ ^ mode , K 

cannot be used to process speech m a nonstationary awi- 55 nufflber ^ { whfch ^ ^ ented b m 

ronment. This is true since there can be a significant differ- ^ ^ re<stimated noise model denoted b Kd+k)> can 

ence between the background noise previously collected and , . . . . _ , c\/\a\/A- *u 

, . . . .... ; . ■«-. * ■ be represented as an interpolation of >J n) and X(k) using the 

the background noise in the actual environment. For this <? « ■ 

. • 1 • • 1 i- following equation: 

reason, the conventional PMC is inadequate lor processing 

noises in a nonstationary state. 60 

It is therefore an object of the present invention to A{« +A) = -^A(n) + — — acaj 

overcome the disadvantages of the prior art. 

SUMMARY OF THE INVENTION 



r i + k n + k 



However, note that there need not be a previous noise model, 
To overcome the above-mentioned limitations of the PMC 65 i.e., n may be zero, such that the re-estimated noise model, 
method, the present invention discloses a two-stage hidden A(h+k), may be determined based solely by k in the current 
Markov model adaptation method. noise model. 
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The re-estimated current noise model parameters in a 
noise model re-estimation circuit 12 is then stored back to 
noise model re-estimation circuit 12. The re-estimated cur- 
rent noise model parameters in the noise model 
re-estimation circuit 12 are then linearly combined with the 
corresponding current clean speech model parameters 
(determined in the clean speech HMMs circuit 18) in the 
PMC circuit 16. Such combination noted combined current 
speech model parameters occurs in the linear spectral 
domain, as described with reference to FIG. 2. 

The combined current speech model parameters, which 
will be named the previous speech model parameters are 
stored in the composite HMMs circuit 14 to be subsequently 
recognized by speech recognizer 11. 

The second stage learning discrimination 20 comprises a 
classifier circuit 22, a discrimination learning circuit 24 and 
a weight HMM circuit 26, Basically, the learning discrimi- 
nation process takes into account robustness issues by 
minimizing the error rate of the test data. 

To minimize the error rate, classifier 22 defines a dis- 
crimination function in terms of a weighted HMM. The 
discrimination function, with respect to the j-th class, 
denoted by g,-, is given by the following equation: 



0) 



where 0=0,, o. 



o r is the input feature vector of T 



number of frames, K is total amount of states, SC, V repre 
sents the corresponding accumulated log probabilities of 
state i in class j, A={w y f ) v// , and w y - represents the corre- 
sponding weighted state i in class j. 

Based on the discrimination function g, a distance 
function, d, measuring the degree of mis-recognition 
between two competing class candidates a and p is defined 
as follows: 



(8) 



where a represents the top candidate and f3 represents the 
next-to-top candidate. 

It can be noted from this equation that a recognition error 
occurs (namely, when a for ft are switched), when d ap <0. 
For each recognition error, a loss function can be defined as 
follows: 



otherwise, 



(9) 



7?{C>; A) =-£/(<*(</)), 



(10) 
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The loss function can have the value of d 0 (be a relatively 
small positive value). 

After the loss function is defined, a risk function, R, can 
be defined. The risk function determines the mean value of 
the loss function for N numbers of training speech data: 



55 



60 



f A M =A/+ AA,,if d{Q)<T 
{ AA/ = -e(i)U V7?A| (0; A(), 



(H) 



where t(t>0) is a preset margin, e(l) is the learning constant 
that is a decreasing function of 1, and U is a positive-definite 
matrix, such as an Identity matrix. The weigh HMM circuit 
26 could store the current weighted parameters adjusted by 
the discriminative learning circuit 24. Thereafter, the current 
weighted parameters would be changed to previous 
weighted parameters. 

Accordingly, a two stage Hidden Markov Model (HMM) 
adaption method having a first stage "on-line" PCM 10 and 
15 a second stage discriminative learning process 20 has been 
described. The advantages of the first stage over convention 
PMC processes include the fact that no pre -collection of 
noise is required and that testing utterances themselves are 
used for model composition, such that the inventive com- 
20 posite models are more robust against changes in environ- 
ment noise. The advantages of the second stage are that it 
reduces the error rate to enhance the overall discrimination 
capability. 

Finally, the above-discussion is intended to be merely 
illustrative of the invention. Numerous alternative embodi- 
ments may be devised by those having ordinary skill in the 
art without departing from the spirit and scope of the 
following claims. 
The claimed invention is: 

1. A method of generating a composite noisy speech 
model, comprising the steps of: 

generating frames of current input utterances based on 

received speech data, 
determining which of said generated frames are aligned 

with noisy states to produce a current noise model, 
re-estimating the produced current noise model by inter- 
polating the number of frames in said current noise 
model with parameters from a previous noise model, 
combining the parameters of said current noise model 
with templates of a corresponding current clean speech 
model to generate templates of a composite noisy 
speech model, 
determining a discrimination function by generating a 
weighted current noise model based on said composite 
noisy speech model, 
determining a distance function by measuring the degree 
of mis-recognition based on said discrimination 
function, 

determining a loss function based on said distance 
function, said loss function being approximately equal 
to said distance function, 
determining a risk function representing the mean value 

of said loss function, and 
generating a current discriminative noise model based in 
part on said risk function, such that the input utterances 
correspond more accurately with the predetermined 
templates of the composite noisy speech model. 

2. The method of claim 1, wherein said step of 
re-estimating being based on the equation: 



where 0=0\ O 2 , . . . , O", and O* represents the training 
speech data. By taking differential derivative, the current 65 
weighted parameter (indicated as A 1+/ ) at the X th adjustment, 
can be obtained using the following adaption equation: 



n + k 



A(n)4 



n + k 



A(*), 



where X(n) represents said parameters of said previous noise 
model, X(k) represents the parameters of frames of said 
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current noise model, and X(n+tc) represents said re-estimated 
current noise model. 

3. The method of claim 2, wherein said generated frames 
aligned with noisy states are determined by a Viterbi decod- 
ing scheme. 

4. The method of claim 3, wherein said combining the 
parameters of the re-estimated current noise model with 
parameters of a corresponding current clean speech model to 
generate a composite noisy speech model is done by using 
a method of parallel model combination. 

5. The method of claim 4, wherein said discrimination 
function being: 



where 0=0^ o 2 . . . , o r represents an input feature vector of 
T number of frames, K is the total number of states, SC ;V 
represents the corresponding accumulated log probability of 20 
state i in class j, and W y ( - represents the corresponding 
weight of state i in class j. 

6. The method of claim 1, wherein the current parameter 
is generated by the steps of: 

determining a distance function by measuring the degree 25 
of mis-recognition based on the discrimination 
function, 

determining a loss function based on the distance 
function, 

determining a risk function for representing the mean 

value of the lose function, and 
generating the current weighted parameters based in part 

on the risk function. 

7. The method of claim 6, wherein said distance function 
being: 

where W„ represents a top weighted candidate and W p 



represents a next-to-top weighted candidate. 

8. The method of claim 6, wherein said loss function 
being: 



30 



35 



11. A system for generating a composite noisy speech 
model, comprising: 

a speech recognizer for generating frames of current input 
utterances based on received speech data, and for 
determining which of said generated frames are aligned 
with noisy states to produce a current noise model, 

a re-estimation circuit for re-estimating the produced 
current noise model by interpolating the number of 
frames in said current noise model with parameters 
from a previous noise model, 

a combiner circuit for combining the parameters of said 
current noise model with templates of a corresponding 
current clean speech model to generate templates of a 
composite noisy speech model, 

a classifier circuit for determining a discrimination func- 
tion by generating a weighted current noise model 
based on said composite noisy speech model, and 

a discrimination learning circuit, 

for determining a distance function by measuring the 
degree of mis-recognition based on said discrimina- 
tion function, 
for determining a loss function based on said distance 
function, said loss function being approximately 
equal to said distance function, 
for determining a risk function representing the mean 

value of said loss function, and 
for generating a current discriminative noise model 
based in part on said risk function, such that the input 
utterances correspond more accurately with the pre- 
determined templates of the composite noisy speech 
model. 

12. The system of claim 11, wherein said step of 
re -estimating being based on the equation: 



40 



A(n + Jt) 



: -A(»)+ -A(fc). 

n+k n + A 



da 8 

l{da&(0)) - tan - 1 — , dap < 0; 0, otherwise 

do 



where d 0 is a positive function. 

9. The method of claim 6, wherein said risk function 
being: 



_ l N 



where X^n) represents said parameters of said previous noise 
45 model, X(k) represents the parameters of frames of said 
current noise model, and X(n+K) represents said re-estimated 
current noise model. 



50 



where OO 1 , O 2 , . . . , O", and O* represents a k th training 
speech data. 

10. The method of claim 9, wherein said current discrimi- 
native noise model being represented by; 

C A /+) = A, + AA|, if d{0) < r 
1 AA^-eCOUVTfA^A,), 



where t(t>0) is a preset margin, e(l) is a learning constant 
that is a decreasing function of 1, and U is a positive- 
definitive matrix, such as an identity matrix. 



60 



65 



13. The system of claim 12, wherein said generated 
frames aligned with noisy states are determined by a Viterbi 
decoding scheme. 

14. The system of claim 13, wherein said combining the 
parameters of the re-estimated current noise model with 
parameters of a corresponding current clean speech model to 
generate a composite noisy speech model is done by using 
a method of parallel model combination. 

15. The system of claim 11, wherein the current parameter 
is generated by the steps of: 

determining a distance function by measuring the degree 
of mis-recognition based on the discrimination 
function, 

determining a loss function based on the distance 
function, 
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determining a risk function for representing the mean 

value of the los function, and „ , , d a * 

l(dap(0)) = tan - 1 — , dap < 0; 0, otherwise 

generating the current weighted parameters based in part d ° 
on the risk function. 

16. The system of claim 14, wherein said discrimination where d 0 is a positive function. 

function being: 19. The system of claim 15, wherein said risk function 

being: 

K 

g J (0,Sj;\) = 2 i iWjj-SCjj) iQ i N 

i=1 7?(0;A) = W<)), 

where 0«o ly o 2 . . , , o r represents an input feature vector of 

T number of frames, K is the total number of states, SC yV where °-° J > 0 2 , . . . , O", and 0* represents a k' A training 
represents the corresponding accumulated log probability of 15 s P eecn data * 

state i in class j, and W,, represents the corresponding 20 : ™ c s y s f em °* c }*™ 19 ' wherein said current dis- 

. , . c t . . t /' cnmmative noise model being represented by: 

weight of state i in class j. & r * 

17. The system of claim 15, wherein said distance func- ( a /+ i = a, + aa,, if d(0) < r 
tion being: 20 



1 AA,= ~c(l)U V/? Al (0;A,X 



. , , , „, where t(t>0) is a preset margin, €(1) is a learning constant 

where W a represents a top weighted candtdate and W p that is a decreasing function of 1, and is a positive-definite 

represents a next-to-top weighted candidate. matrix> such as an identity matrix 

18. The system of claim 15, wherein said loss function 
being: ***** 
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